Skip to content

Comments

feat: Ignore DPUs, so that they can be externally managed#85

Open
vinodchitraliNVIDIA wants to merge 1 commit intoNVIDIA:mainfrom
vinodchitraliNVIDIA:vc/nic
Open

feat: Ignore DPUs, so that they can be externally managed#85
vinodchitraliNVIDIA wants to merge 1 commit intoNVIDIA:mainfrom
vinodchitraliNVIDIA:vc/nic

Conversation

@vinodchitraliNVIDIA
Copy link

@vinodchitraliNVIDIA vinodchitraliNVIDIA commented Jan 27, 2026

Description

The case when DPU(s) physically exists on host. They are assigned with machine interface, but they are not part Managed Host. The PR skips DPU configaration. And Force Host to boot from NIC

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@vinodchitraliNVIDIA vinodchitraliNVIDIA requested a review from a team as a code owner January 27, 2026 06:08
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 27, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@vinodchitraliNVIDIA vinodchitraliNVIDIA force-pushed the vc/nic branch 12 times, most recently from 2a94f1f to 68e2edc Compare January 27, 2026 12:44
@github-actions
Copy link

🛡️ Vulnerability Scan

🚨 Found 30 vulnerability(ies)

Severity Breakdown:

  • 🔴 Critical/High: 30
  • 🟡 Medium: 0
  • 🔵 Low/Info: 0
📋 Top Vulnerabilities
  • GHSA-2cgv-28vr-rv6j: Package: libcrux-intrinsics
    Installed Version: 0.0.3
    Vulnerability GHSA-2cgv-28vr-rv6j
    Severity: HIGH
    Fixed Version: 0.0.4
    Link: GHSA-2cgv-28vr-rv6j (Cargo.lock)
  • DS002: Artifact: crates/api/Dockerfile
    Type: dockerfile
    Vulnerability DS002
    Severity: HIGH
    Message: Specify at least 1 USER command in Dockerfile with non-root user as argument
    Link: DS002 (crates/api/Dockerfile)
  • DS002: Artifact: crates/dhcp/Dockerfile
    Type: dockerfile
    Vulnerability DS002
    Severity: HIGH
    Message: Specify at least 1 USER command in Dockerfile with non-root user as argument
    Link: DS002 (crates/dhcp/Dockerfile)
  • DS002: Artifact: crates/dns/Dockerfile
    Type: dockerfile
    Vulnerability DS002
    Severity: HIGH
    Message: Specify at least 1 USER command in Dockerfile with non-root user as argument
    Link: DS002 (crates/dns/Dockerfile)
  • KSV014: Artifact: deploy/carbide-base/api/deployment.yaml
    Type: kubernetes
    Vulnerability KSV014
    Severity: HIGH
    Message: Container 'carbide-api' of Deployment 'carbide-api' should set 'securityContext.readOnlyRootFilesystem' to true
    Link: KSV014 (deploy/carbide-base/api/deployment.yaml)
  • KSV118: Artifact: deploy/carbide-base/api/deployment.yaml
    Type: kubernetes
    Vulnerability KSV118
    Severity: HIGH
    Message: deployment carbide-api in default namespace is using the default security context, which allows root privileges
    Link: KSV118 (deploy/carbide-base/api/deployment.yaml)
  • KSV014: Artifact: deploy/carbide-base/api/migration.yaml
    Type: kubernetes
    Vulnerability KSV014
    Severity: HIGH
    Message: Container 'carbide-api-migrate' of Job 'carbide-api-migrate' should set 'securityContext.readOnlyRootFilesystem' to true
    Link: KSV014 (deploy/carbide-base/api/migration.yaml)
  • KSV118: Artifact: deploy/carbide-base/api/migration.yaml
    Type: kubernetes
    Vulnerability KSV118
    Severity: HIGH
    Message: container carbide-api-migrate in default namespace is using the default security context
    Link: KSV118 (deploy/carbide-base/api/migration.yaml)
  • KSV118: Artifact: deploy/carbide-base/api/migration.yaml
    Type: kubernetes
    Vulnerability KSV118
    Severity: HIGH
    Message: job carbide-api-migrate in default namespace is using the default security context, which allows root privileges
    Link: KSV118 (deploy/carbide-base/api/migration.yaml)
  • KSV014: Artifact: deploy/carbide-base/dhcp/deployment.yaml
    Type: kubernetes
    Vulnerability KSV014
    Severity: HIGH
    Message: Container 'carbide-dhcp' of Deployment 'carbide-dhcp' should set 'securityContext.readOnlyRootFilesystem' to true
    Link: KSV014 (deploy/carbide-base/dhcp/deployment.yaml)

💡 Note: Enable GitHub Advanced Security to see full details in the Security tab.

@github-actions
Copy link

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

@github-actions
Copy link

🛡️ CodeQL Analysis

✅ No security issues found!

💡 Note: Enable GitHub Advanced Security to see full details in the Security tab.

@kensimon
Copy link
Contributor

@vinodchitraliNVIDIA could you make sure to write up a description of your changes? This is hard to review because I have zero context for what this change is for.

Particularly, why is this change needed? When doing zero-dpu testing we tested with hosts that had onboard NICs and it worked fine. I'm confused why there needs to be any changes here (I don't doubt there does, but I just want to understand what's lacking in the current code.)

@vinodchitraliNVIDIA
Copy link
Author

vinodchitraliNVIDIA commented Jan 27, 2026

@vinodchitraliNVIDIA could you make sure to write up a description of your changes? This is hard to review because I have zero context for what this change is for.

Particularly, why is this change needed? When doing zero-dpu testing we tested with hosts that had onboard NICs and it worked fine. I'm confused why there needs to be any changes here (I don't doubt there does, but I just want to understand what's lacking in the current code.)

In zero DPU case - physically there wont be any DPU exists. If allow_zero_dpu_hosts flag is set then machines are created.

Our usecase DPU(s) physically exists host. They are assigned with machine interface, but they are not part Managed Host by skipping DPU configaration. Use case looks more or less same with later case DPU(s) may or may not exists

  • Will update the commit description

@github-actions
Copy link

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

@github-actions
Copy link

🛡️ Vulnerability Scan

🚨 Found 30 vulnerability(ies)

Severity Breakdown:

  • 🔴 Critical/High: 30
  • 🟡 Medium: 0
  • 🔵 Low/Info: 0
📋 Top Vulnerabilities
  • GHSA-2cgv-28vr-rv6j: Package: libcrux-intrinsics
    Installed Version: 0.0.3
    Vulnerability GHSA-2cgv-28vr-rv6j
    Severity: HIGH
    Fixed Version: 0.0.4
    Link: GHSA-2cgv-28vr-rv6j (Cargo.lock)
  • DS002: Artifact: crates/api/Dockerfile
    Type: dockerfile
    Vulnerability DS002
    Severity: HIGH
    Message: Specify at least 1 USER command in Dockerfile with non-root user as argument
    Link: DS002 (crates/api/Dockerfile)
  • DS002: Artifact: crates/dhcp/Dockerfile
    Type: dockerfile
    Vulnerability DS002
    Severity: HIGH
    Message: Specify at least 1 USER command in Dockerfile with non-root user as argument
    Link: DS002 (crates/dhcp/Dockerfile)
  • DS002: Artifact: crates/dns/Dockerfile
    Type: dockerfile
    Vulnerability DS002
    Severity: HIGH
    Message: Specify at least 1 USER command in Dockerfile with non-root user as argument
    Link: DS002 (crates/dns/Dockerfile)
  • KSV014: Artifact: deploy/carbide-base/api/deployment.yaml
    Type: kubernetes
    Vulnerability KSV014
    Severity: HIGH
    Message: Container 'carbide-api' of Deployment 'carbide-api' should set 'securityContext.readOnlyRootFilesystem' to true
    Link: KSV014 (deploy/carbide-base/api/deployment.yaml)
  • KSV118: Artifact: deploy/carbide-base/api/deployment.yaml
    Type: kubernetes
    Vulnerability KSV118
    Severity: HIGH
    Message: deployment carbide-api in default namespace is using the default security context, which allows root privileges
    Link: KSV118 (deploy/carbide-base/api/deployment.yaml)
  • KSV014: Artifact: deploy/carbide-base/api/migration.yaml
    Type: kubernetes
    Vulnerability KSV014
    Severity: HIGH
    Message: Container 'carbide-api-migrate' of Job 'carbide-api-migrate' should set 'securityContext.readOnlyRootFilesystem' to true
    Link: KSV014 (deploy/carbide-base/api/migration.yaml)
  • KSV118: Artifact: deploy/carbide-base/api/migration.yaml
    Type: kubernetes
    Vulnerability KSV118
    Severity: HIGH
    Message: container carbide-api-migrate in default namespace is using the default security context
    Link: KSV118 (deploy/carbide-base/api/migration.yaml)
  • KSV118: Artifact: deploy/carbide-base/api/migration.yaml
    Type: kubernetes
    Vulnerability KSV118
    Severity: HIGH
    Message: job carbide-api-migrate in default namespace is using the default security context, which allows root privileges
    Link: KSV118 (deploy/carbide-base/api/migration.yaml)
  • KSV014: Artifact: deploy/carbide-base/dhcp/deployment.yaml
    Type: kubernetes
    Vulnerability KSV014
    Severity: HIGH
    Message: Container 'carbide-dhcp' of Deployment 'carbide-dhcp' should set 'securityContext.readOnlyRootFilesystem' to true
    Link: KSV014 (deploy/carbide-base/dhcp/deployment.yaml)

💡 Note: Enable GitHub Advanced Security to see full details in the Security tab.

@github-actions
Copy link

🛡️ CodeQL Analysis

✅ No security issues found!

💡 Note: Enable GitHub Advanced Security to see full details in the Security tab.

@Matthias247
Copy link
Contributor

we should find a way to support the use-case without building yet another feature. I'm concerned that even the zero-dpu path has no docs and barely seen any testing. This has even less (no unit-tests), so its very likely just to be forgotten and break.

Any chance we can get this just aligned with regular zero-dpu? E.g. remove the DPUs from the target hosts, or somehow set them in NIC mode and then just use the NIC that the host boots from?

I also feel like this change might just be the very beginning of supporting such hosts: We'd also need to look at the inventory path, SKUs and SKU validation, Software updates, etc.

@kensimon
Copy link
Contributor

Our usecase DPU(s) physically exists host. They are assigned with machine interface, but they are not part Managed Host by skipping DPU configaration. Use case looks more or less same with later case DPU(s) may or may not exists

@vinodchitraliNVIDIA do you mean the DPU's are in NIC mode? Because the zero-dpu case already covers this (it we discover a host and its DPU is in nic-mode, it's treated as a zero-DPU host. We even have integration tests for this.)

Or, do you mean these are DPF-managed hosts or something else, where we see a DPU, and it's not in NIC mode, but carbide opts to not manage the DPU in favor of something else managing it? If so, it's probably worth (a) spelling that out in the PR description, and (b) renaming some of the things here to indicate "externally managed DPU" instead of "onboard NIC"

@vinodchitraliNVIDIA
Copy link
Author

Our usecase DPU(s) physically exists host. They are assigned with machine interface, but they are not part Managed Host by skipping DPU configaration. Use case looks more or less same with later case DPU(s) may or may not exists

@vinodchitraliNVIDIA do you mean the DPU's are in NIC mode? Because the zero-dpu case already covers this (it we discover a host and its DPU is in nic-mode, it's treated as a zero-DPU host. We even have integration tests for this.)

Or, do you mean these are DPF-managed hosts or something else, where we see a DPU, and it's not in NIC mode, but carbide opts to not manage the DPU in favor of something else managing it? If so, it's probably worth (a) spelling that out in the PR description, and (b) renaming some of the things here to indicate "externally managed DPU" instead of "onboard NIC"

DPU's can be in any mode. "externally managed DPU" let me think through.

Initially i used allow_zero_dpu_hosts but the managed_host has discovered dpus hence the zero dpu code path never hit. This is bcz the code assumes that there will be no DPU associated with explored host.

@vinodchitraliNVIDIA
Copy link
Author

we should find a way to support the use-case without building yet another feature. I'm concerned that even the zero-dpu path has no docs and barely seen any testing. This has even less (no unit-tests), so its very likely just to be forgotten and break.

Any chance we can get this just aligned with regular zero-dpu? E.g. remove the DPUs from the target hosts, or somehow set them in NIC mode and then just use the NIC that the host boots from?

I also feel like this change might just be the very beginning of supporting such hosts: We'd also need to look at the inventory path, SKUs and SKU validation, Software updates, etc.

current ZERO DPU code path assumes that host will not have attached DPU, @kensimon correct me if i am wrong. Our case DPUs will be attached host, carbide DHCP will assign IP but they are not configured in managed host. Let me add few test cases

@vinodchitraliNVIDIA vinodchitraliNVIDIA changed the title feat: creating managed host for onboard nic hosts feat: Ignore DPUs, so that they can be externally managed Jan 29, 2026
The case when DPU(s) physically exists host. They are assigned with machine interface, but they are not part Managed Host. The PR skips DPU configaration.
This allows DPUs stays outside the machine's lifecycle workflow

Signed-off-by: Vinod Chitrali <vchitrali@nvidia.com>
@vinodchitraliNVIDIA
Copy link
Author

@ajf @kensimon please review the code

@kensimon
Copy link
Contributor

current ZERO DPU code path assumes that host will not have attached DPU, @kensimon correct me if i am wrong. Our case DPUs will be attached host, carbide DHCP will assign IP but they are not configured in managed host. Let me add few test cases

@vinodchitraliNVIDIA this is not the case, systems can have DPU's in what is called "NIC mode", which means they show up as a "dumb NIC" on the device and are not something we put any of our code on. This is a BIOS setting you have to set on the system. If it's set, we don't count it as a "DPU" on the device, it's just a regular NIC. So the zero-DPU code paths apply if the only DPU's on the host are NIC-mode.

But a NIC-mode DPU means nobody else can put code on it either... so putting a DPU in NIC mode means it can't be DPF-managed either.

return Ok(false);
}
tracing::info!("Created managed_host with zero DPUs");
} else if self.config.use_onboard_nic.load(Ordering::Relaxed) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand what this change is supposed to do. If we've gotten to this point in site_explorer, we've seen no DPU's on the host, and if zero-DPU configuration is allowed, we already ingest it with no DPU's.

Why do we need another config setting called use_onboard_nic, and another function create_onboard_nic_machine? What do these do that are different from the zero-dpu path?

Copy link
Author

@vinodchitraliNVIDIA vinodchitraliNVIDIA Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kensimon is there any known bug ? Couple of months ago I tried with zero dpu flag. It din't work. Since the GB200 machine has multiple MAC address. Faced issue there. Also DPU list in managed host is not empty.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I know of? If there is a bug that you found, we should fix it. I don't think we need a fully separate code path for GB200's when we can just fix the current one.

If you have an issue with GB200's in zero-dpu mode, could you file an nvbug and assign it to me? I need logs/details/reproduction steps if possible.

@ajf
Copy link
Collaborator

ajf commented Feb 4, 2026

@vinodchitraliNVIDIA can you schedule a meeting for how this is different from the Zero-DPU code path? Just like 15 minutes to explain what issues you're running into. Zero-DPU ideally should work with any type of NIC on the machine.

@vinodchitraliNVIDIA
Copy link
Author

@vinodchitraliNVIDIA can you schedule a meeting for how this is different from the Zero-DPU code path? Just like 15 minutes to explain what issues you're running into. Zero-DPU ideally should work with any type of NIC on the machine.

Sure will do. Let me setup env with with zero DPU with latest GIT hub.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants